[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so#25605
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so#25605tlrmchlsmth merged 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
There was a problem hiding this comment.
Code Review
This pull request aims to fix a crash on ROCm when certain NCCL symbols are not present by silently ignoring them during library initialization. However, the current implementation appears to be flawed. It checks if the function pointer is None, whereas ctypes is expected to raise an AttributeError for missing symbols. This suggests the fix may not work as intended. I have provided a suggestion to correctly handle missing symbols using a try...except block, which should make the fix more robust.
|
I'm not sure if it is better to exclude nccl specific functions like what you did here or include them like: #25608 |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
That's also possible, only the condition should not be CUDA/ROCm, but ideally, NCCL version. |
| "ncclCommWindowDeregister" | ||
| ]: | ||
| # These symbols require NCCL >= 2.27.03, and having | ||
| # an exception here on ROCm platform is not allowed |
There was a problem hiding this comment.
Can you add a logger.warn_once here that mentions that these symbols failed to import and that users should update their nccl/rccl libraries if they want to use them?
…nabled Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
I confirm that it is necessary to be able to run mGPU models in ROCM, it solves the following error: |
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ccl.so (#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…ccl.so (vllm-project#25605) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Fixing an issue from #24532
ncclCommWindowRegister and ncclCommWindowDeregister only exist on nccl >= 2.27.03
On ROCm having an unhandled exception (trying to import symbols in the above part of the code crashes graph capturing due to the
HIP error: operation not permitted when stream is capturingerror.I suppose the least invasive solution would be to swallow the import error on ROCm for these two symbols silently.
cc @Amir-19